knitr::opts_chunk$set( collapse = TRUE, fig.width = 8, fig.height = 6, fig.align = "center", comment = "#>", message = FALSE, warning = FALSE ) options(rmarkdown.html_vignette.check_title = FALSE)
This vignette guides you through the new functionality that turns a mrIML
multi-response model into a graphical network model (GNM). There are also new functions in mrIML (>= 2.0.0)
that can be used in non-GNM models, such as for single-taxon SDMs, epi models, gene-environment association (GEA) studies, and landscape genetics. Advances in Tidymodels are at the core of these functions, as are exciting developments in interpretable machine learning (e.g., for quickly quantifying interactions through H-statistics). The capability to capture uncertainty via bootstraps and spatial patterns are also features of this mrIML
module. The examples come from the microbiome world but are, of course, generalizable to any multi-response problem.
Currently, the models are optimized and tested on presence/absence data. Models suitable for abundance data will be added in the future.
To demonstrate, first, we need to load some packages and data. The first example uses coinfection data from New Caledonian Zosterops species. A single continuous covariate is also included (scale.prop.zos
), which reflects the relative abundance of Zosterops species among different sample sites.
library(mrIML) library(tidymodels) library(flashlight) library(ggplot2) library(patchwork) library(igraph) library(ggnetwork) library(network) set.seed(7007)
data <- MRFcov::Bird.parasites Y <- data %>% dplyr::select(-scale.prop.zos) %>% dplyr::select(order(everything())) X <- data %>% dplyr::select( scale.prop.zos ) X1 <- Y
Note that the inclusion of X1
converts mrIML
into a Joint Species Distribution Model (JSDM) by directly adding the presence/absence patterns of the other taxa into the model as predictors (along with other environmental/host covariates). Note that the order of X1
and Y
needs to match.
Built into the mrIML
architecture (and a big advantage of tidymodels
) is the capability to change the underlying model easily. We are going to set up two models to compare: a random forest model (RF) and logistic regression (LM). mrIML
takes advantage of multi-core processing, so we set it up here to run on 5 cores. These steps are the same as in mrIML (< 2.0.0)
.
future::plan("multisession", workers = 5)
model_rf <-rand_forest( trees = 100, mode = "classification", mtry = tune(), min_n = tune() ) %>% set_engine("randomForest") model_lm <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification")
Aside from adding JSDM functionality with the X1
call, we have also enabled mrIML
to tune hyperparameters using the very efficient racing
option (see Kuhn (2014)). In brief, this racing option takes a small subsample of parameters and eliminates combinations that do not improve fit using a repeated-measures ANOVA model with finetune::tune_race_anova()
. Setting racing = FALSE
reverts to a grid search if you want to manually set the tuning grid size (i.e., using tune::tune_grid()
). For logistic regression, there are no parameters to tune, so set racing = FALSE
.
#random forest yhats_rf <- mrIMLpredicts( X = X, Y = Y, X1 = Y, Model = model_rf, prop = 0.7, k = 5, racing = TRUE ) #linear model yhats_lm <- mrIMLpredicts( X = X, Y = Y, X1 = Y, Model = model_lm , balance_data = "no", prop = 0.6, k = 5, racing = FALSE )
It's important to compare whether there are any advantages to using a random forest approach. Interpretation would be easier overall if logistic regression gave similar predictive performance.
ModelPerf_rf <- mrIMLperformance(yhats_rf) ModelPerf_rf[[1]] #across all parasites ModelPerf_rf[[2]] #overall ModelPerf_lm <- mrIMLperformance(yhats_lm) ModelPerf_lm[[1]] ModelPerf_lm[[2]] plots <- mrPerformancePlot( ModelPerf1 = ModelPerf_lm, ModelPerf2 = ModelPerf_rf, mode = "classification" ) plots[[1]] / plots[[2]] plots[[3]]
If we just look at the overall AUC values, model performance appears quite similar (r mean(ModelPerf_rf[[1]]$roc_AUC) %>% round(digits = 2)
for the RF and r mean(ModelPerf_lm[[1]]$roc_AUC) %>% round(digits = 2)
for the LM). However, when we look at the Matthews correlation coefficient (MCC) for each taxon in the LM model, we see that for H.killangoi and Plas (Plasmodium), the values are much lower (an overall MCC of r mean(ModelPerf_lm[[2]]) %>% round(digits = 2)
for LM is basically just a guess, compared to r mean(ModelPerf_rf[[2]]) %>% round(digits = 2)
for the RF). Remember, the classes are imbalanced, so AUC tends to be an inflated measure. This provides evidence that non-linear relationships can improve predictive performance.
Next, we ask whether including putative associations between taxa improves model performance, or whether the relationship between parasites and host relative abundance alone is sufficient (i.e., assuming independence between the parasites).
yhats_rf_noAssoc <- mrIMLpredicts( X = X, Y = Y, Model = model_rf, prop = 0.7, k = 5, racing = TRUE ) ModelPerf_rf_noAssoc <- mrIMLperformance(yhats_rf_noAssoc) ModelPerf_rf_noAssoc[[1]] ModelPerf_rf[[1]] #performance including associations
You can see that overall, including associations improves model performance, particularly for predicting H.killangoi and Microfilaria. Using MCC to compare models is problematic because, in the association-free model, it's undefined for these taxa (NA
values arise because there are no false negatives for low-prevalence taxa). Positive predictive value (PPV) is useful in this case and shows that without associations, we can't predict the occurrence of these taxa (PPV = 0 for both). Including associations increases PPV to \~0.2—not great, as 80% of our positive predictions for these taxa are false.
Including associations makes a difference, but how can we improve predictions for our two rarer taxa? Upsampling is possible, but in this case, we'll try downsampling to see if correcting class imbalance improves model fit.
yhats_rf_downSamp <- mrIMLpredicts( X = X, Y = Y, X1 = Y, Model = model_rf, balance_data = "down", #down sampling prop = 0.75, k = 5, racing = TRUE ) ModelPerf_rf_downSamp <- mrIMLperformance(yhats_rf_downSamp) ModelPerf_rf_downSamp[[1]]
Look at those PPV values now—much better. Our false positive rate is down to <6% overall. Now that we are happy with the performance of the model, we can interrogate it further.
In many cases, like this dataset, community or microbiome data tend to be small in size. Applying stochastic machine learning algorithms to such data can lead to challenges. For instance, variable importance may vary substantially when we build multiple models using the same data and algorithm. To handle this variability and better understand prediction uncertainty, mrIML
now includes functionality to capture uncertainty in the tuned model using bootstrapping. Additionally, this approach helps estimate how variables affect the response, and these estimates align with those from traditional linear regression models (see Cook et al., 2021).
mrIML
makes it easy to obtain bootstrap estimates for a variety of interpretable machine learning tools and uses these estimates to construct marginalized co-occurrence networks. First, let's perform bootstrapping and calculate variable importance.
bs_malaria <- mrBootstrap( yhats_rf, num_bootstrap = 100, downsample = TRUE ) bs_impVI <- mrVip( mrIMLobj = yhats_rf, mrBootstrap_obj = bs_malaria, model_perf = ModelPerf_rf ) bs_impVI[[3]]
You can see that host abundance is the most important predictor of this parasite community (followed by H.zosteropsis). However, the sub-figure on the right shows substantial variability. For example, H.zosteropsis is the most important predictor for the occurrence of Microfilaria, and host abundance (shortened to "sc..") is less important.
To examine the relationship between each variable and community structure, mrIML
has a convenient wrapper to plot bootstrapped partial dependencies for a taxon of interest.
pds <- mrPdPlotBootstrap( yhats_rf, mrBootstrap_obj=bs_malaria, vi_obj=bs_impVI, target='Plas', global_top_var=5 ) pds[[2]]
These plots show that the presence of Microfilaria greatly increases the probability of observing Plasmodium (from \~0.47 to 0.70 while holding all other variables at their mean value). Note that these are marginal relationships (i.e., isolating the effect of each predictor). When host abundance is high, the probability of detecting Plasmodium decreases non-linearly (with a threshold around \~0.5). The other parasites have less effect.
To explore the effect of host abundance overall, we can use the mrCovar()
function.
covar <- mrCovar( yhats_rf, var = "scale.prop.zos", sdthresh = 0.01 ) covar[[1]] / covar[[2]] / covar[[3]]
Note that this isn't bootstrapped—each line represents a taxon. The second plot shows the community-wide average effect with taxon-specific effects silhouetted in the background. The third plot shows the community-wide average change in occurrence probabilities across host abundance. Note that the occurrence probabilities of all taxa drop at intermediate levels of host abundance (0.75–1.25).
We can utilize all the bootstrapped partial dependence estimates (PDs) to construct a co-occurrence network. The following code converts the object to an igraph
object and plots it. This is a directed network, and edges are scaled by the standard deviation of the marginal change in prediction (along the PD curves). Red edges are positive associations (i.e., predicted occurrence increases with the presence of the other taxon), and blue edges are negative.
assoc_net<- mrCoOccurNet(bs_malaria) # Based on simulations (not shown here), we use # the following rule of thumb for associations: # Any association > 0.05 for mean strength is # included assoc_net_filtered <- assoc_net %>% filter(mean_strength > 0.05) # Convert mrIML to igraph g <- graph_from_data_frame( assoc_net_filtered, directed = TRUE, vertices = names(Y) ) E(g)$Value <- assoc_net_filtered$mean_strength E(g)$Color <- ifelse( assoc_net_filtered$direction == "negative", "blue", "red" ) # Convert the igraph to a ggplot with NMDS layout gg <- ggnetwork(g) # Plot the graph gg %>% ggplot( aes( x = x, y = y, xend = xend, yend = yend ) ) + geom_edges( aes(color = Color, linewidth = Value), curvature = 0.2, arrow = arrow( length = unit(5, "pt"), type = "closed" ) ) + geom_nodes( color = "gray", size = degree(g, mode = "out")/2 ) + scale_color_identity() + theme_void() + theme(legend.position = "none") + geom_nodelabel_repel( aes(label = name), box.padding = unit(0.5, "lines"), data = gg, size = 3, segment.colour = "black", colour = "white", fill = "grey36" ) + coord_cartesian(xlim = c(-0.2, 1.2), ylim = c(-0.2, 1.2))
Finally, we can quantify the overall importance of interactions in different taxa models, as well as the importance of one- and two-way interactions for a specific taxon using the same bootstrapping approach. To quantify interaction importance, we use H-statistics from the hstats
package.
int_ <- mrInteractions( yhats_rf, num_bootstrap = 100, feature = "Plas", top_int = 10 ) int_[[1]] / # overall plot int_[[2]] / # individual plot for feature int_[[3]] # two way plot for feature
The first plot shows that interactions account for an average of 60% (with a 90% bootstrap uncertainty interval of 44–81%) of variation in predictions for H.killangoi, and less for the other taxa. The second plot shows that interactions involving host abundance most affect predictions of Plasmodium (although H.zosteropis is also important and, according to the uncertainty intervals, could be even more important). This trend is similar community-wide. The last plot shows that the interaction between Haemoproteus presence and host abundance is the most important two-way interaction for Plasmodium, but this isn’t true community-wide, as host abundance and H.zosteropis form the strongest interaction overall. Taken together, we can see that interactions between taxa are mediated by host abundance.
Finally, we can explore specific interactions in more detail using 2D partial dependence plots implemented in the flashlight
package. Below, we plot how the two-way interaction between host abundance and H.zosteropis impacts the probability of detecting Plasmodium.
fl <- mrFlashlight( yhats_rf_downSamp, response = "single", index = 4 #index=4 selects Plas ) fl %>% light_profile2d( c("scale.prop.zos", "Hzosteropis"), data ) %>% plot() + theme_bw()
So, if H.zosteropsis is not present and the relative abundance of Zosterops species is low, the probability of observing Plasmodium is high (>0.7).
Cook et al., 2021: https://doi.org/10.18651/RWP2021-12 Kuhn (2014): https://doi.org/10.48550/arXiv.1405.6974 Fountain-Jones et al(2021): https://doi.org/10.1111/1755-0998.13495
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.